09. Random Forest Hyperparameters

Random Forest Hyperparameters

You may have noticed that the values of several hyperparameters were set to default values when we instantiated the Random Forest model in the last exercise. Let's discuss a few of these hyperparameters and learn how they influence the model.

We've seen a few of the Random Forest hyperparameters before because they are also hyperparameters of the individual decision trees in the forest.

min_samples_leaf

As before, this is the minimum number of observations allowed at a leaf. Setting this hyperparameter keeps the algorithm from further splitting nodes with very few observations.

min_samples_split

As before, this is the minimum number of observations required to be at node before it can be split. Setting this hyperparameter keeps the algorithm from further splitting nodes with very few observations.

However, as stated earlier, this hyperparameter does not actually prevent very small leaf nodes from being created. If a node has at least min_samples_split observations, then it can be split, and this split can result in a leaf with fewer than min_samples_split observations.

max_features

This sets the maximum number of features to evaluate when randomly sampling features at each split and deciding which feature to use to create the next split. The default value is the square root of the total number of features in the dataset.

In fact this is also a hyperparameter of the single decision tree classes, so it's possible to randomly choose subsets of features to evaluate at each split even when growing a single decision tree.

n_estimators

This is not a hyperparameter of individual decision trees because it's only applicable when growing forests—this is the number of trees to grow in the forest.

oob_score

This is a boolean hyperparameter that you set to True if you want the out-of-bag score to be calculated as an estimate of out-of-sample accuracy.

bootstrap

This is a boolean hyperparameter that sets whether or not bootstrap samples are used to grow the trees. If False , the entire original dataset is used to grow each tree.

n_jobs

This parameter allows you to use parallel threads to perform some parts of the algorithm's computations. Set n_jobs = -1 to use all available CPUs. Most often, parallelism happens in fitting, but sometimes, as for random forests, it happens during prediction.